Methods for analyzing nucleic acids

ABSTRACT

Aspects of the present invention include analyzing nucleic acids from single cells using methods that include using tagged polynucleotides containing multiplex identifier sequences.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.17/129,684, filed Dec. 21, 2020, which is a continuation of U.S.application Ser. No. 16/984,034, filed Aug. 3, 2020, which is acontinuation of U.S. application Ser. No. 15/930,333, filed May 12,2020, now U.S. Pat. No. 10,767,223, which is a continuation of U.S.application Ser. No. 16/817,461, filed Mar. 12, 2020, now U.S. Pat. No.10,697,013, which is a continuation of U.S. application Ser. No.16/443,703, filed Jun. 17, 2019, now U.S. Pat. No. 10,633,702, which isa continuation of U.S. application Ser. No. 16/397,832, filed Apr. 29,2019, now U.S. Pat. No. 10,392,662, which is a continuation of U.S.application Ser. No. 16/282,188, filed Feb. 21, 2019, now U.S. Pat. No.10,337,063, which is a continuation of U.S. application Ser. No.16/261,268, filed Jan. 29, 2019, now U.S. Pat. No. 10,280,459, which isa continuation of U.S. application Ser. No. 16/194,047, filed Nov. 16,2018, now U.S. Pat. No. 10,240,197, which is a continuation of U.S.application Ser. No. 15/677,957, filed Aug. 15, 2017, now U.S. Pat. No.10,155,981, which is a continuation of U.S. application Ser. No.14/792,094, filed Jul. 6, 2015, which is a continuation of U.S.application Ser. No. 14/172,694, filed Feb. 4, 2014, now U.S. Pat. No.9,102,980, which is a continuation of U.S. application Ser. No.14/021,790, filed Sep. 9, 2013, now U.S. Pat. No. 8,679,756, which is acontinuation of U.S. application Ser. No. 13/859,450, filed Apr. 9,2013, now U.S. Pat. No. 8,563,274, which is a continuation of U.S.application Ser. No. 13/622,872, filed Sep. 19, 2012, which is acontinuation of U.S. application Ser. No. 13/387,343, filed Feb. 15,2012, now U.S. Pat. No. 8,298,767, which is a § 371 National PhaseApplication of PCT/IB2010/002243, filed Aug. 13, 2010, which claimspriority to U.S. Provisional Application No. 61/235,595, filed Aug. 20,2009 and U.S. Provisional Application No. 61/288,792, filed Dec. 21,2009; all of which are herein incorporated by reference.

BACKGROUND OF THE INVENTION

We have previously described methods that enable tagging each of apopulation of fragmented genomes and then combining them together tocreate a ‘population library’ that can be processed and eventuallysequenced as a mixture. The population tags enable analysis software toparse the sequence reads into files that can be attributed to aparticular genome in the population. One limitation of the overallprocess stems from limitations of existing DNA sequencing technologies.In particular, if fragments in the regions of interest of the genome arelonger than the lengths that can be sequenced by a particulartechnology, then such fragments will not be fully analyzed (sincesequencing proceeds from an end of a fragment inward). Furthermore, adisadvantage of any sequencing technology dependent on fragmentation isthat sequence changes in one part of a particular genomic region may notbe able to be linked to sequence changes in other parts of the samegenome (e.g., the same chromosome) because the sequence changes resideon different fragments. (See FIG. 5 and its description below).

The present invention removes the limitations imposed by currentsequencing technologies as well as being useful in a number of othernucleic acid analyses.

SUMMARY OF THE INVENTION

Aspects of the present invention are drawn to processes for moving aregion of interest in a polynucleotide from a first position to a secondposition with regard to a domain within the polynucleotide, alsoreferred to as a “reflex method” (or reflex process, reflex sequenceprocess, reflex reaction, and the like). In certain embodiments, thereflex method results in moving a region of interest into functionalproximity to specific domain elements present in the polynucleotide(e.g., primer sites and/or MID). Compositions, kits and systems thatfind use in carrying out the reflex processes described herein are alsoprovided.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is best understood from the following detailed descriptionwhen read in conjunction with the accompanying drawings. It isemphasized that, according to common practice, the various features ofthe drawings are not to scale. Indeed, the dimensions of the variousfeatures are arbitrarily expanded or reduced for clarity. Included inthe drawings are the following figures:

FIG. 1: Panel A is a schematic diagram illustrating moving a firstdomain from one site to another in a nucleic acid molecule using areflex sequence. Panel B is a schematic diagram depicting the relativeposition of primer pairs (A_(n)-B_(n) primers) that find use in aspectsof the reflex process described herein.

FIG. 2 shows an exemplary embodiment of using binding partner pairs(biotin/streptavidin) to isolate single stranded polynucleotides ofinterest.

FIG. 3 is a schematic diagram illustrating an exemplary embodiment formoving a primer site and a MID to a specific location in a nucleic acidof interest.

FIG. 4 shows a schematic diagram illustrating an exemplary use of thereflex process for generating a sample enriched for fragments having aregion of interest (e.g., from a population of randomly fragmented andasymmetrically tagged polynucleotides).

FIG. 5 shows a comparison of methods for identifying nucleic acidpolymorphisms in homologous nucleic acids in a sample (e.g., the sameregion derived from a chromosomal pair of a diploid cell or viralgenomes/transcripts). The top schematic shows two nucleic acid moleculesin a sample (1 and 2) having a different assortment of polymorphisms inpolymorphic sites A, B and C (A1, B1, C1 and C2). Standard sequencingmethods using fragmentation (left side) can identify the polymorphismsin these nucleic acids but do not retain linkage information. Employingthe reflex process described herein to identify polymorphisms (rightside) maintains linkage information.

FIG. 6: Panel A is a schematic showing expected structures and sizes ofnucleic acid species in the reflex process; Panel B is a polyacrylamidegel showing the nucleic acid species produced in the reflex processdescribed in Example 1.

FIG. 7: Panel A is a schematic showing the structure of the nucleic acidand competitor used in the reflex process; Panel B is a polyacrylamidegel showing the nucleic acid species produced in the reflex processdescribed in Example 1.

FIG. 8 shows a flow chart of a reflex process (left) in which the T7exonuclease step is optional. The gel on the right shows the resultantproduct of the reflex process either without the T7 exonuclease step(lane 1) or with the T7 exonuclease step (lane 2).

FIG. 9 shows an exemplary reflex process workflow with indications onthe right as to where purification of reaction products is employed(e.g., using Agencourt beads to remove primer oligos).

FIG. 10 shows the starting material (left panel) and the resultantproduct generated (right panel) using a reflex process without using aT7 exonuclease step (as described in Example II). The reflex site in thestarting material is a sequence normally present in the polynucleotidebeing processed (also called a “non-artificial” reflex site). Thisfigure shows that the 755 base pair starting nucleic acid was processedto the expected 461 base pair product, thus confirming that a“non-artificial” reflex site is effective in transferring an adapterdomain from one location to another in a polynucleotide of interest in asequence specific manner.

FIG. 11 shows a schematic and results of an experiment in which thereflex process is performed on a single large initial template (a“parent” fragment) to generate five different products (“daughter”products) each having a different region of interest (i.e., daughterproducts are produced having either region 1, 2, 3, 4 or 5).

FIG. 12 shows a schematic and results of experiments performed todetermine the prevalence of intramolecular rearrangement during thereflex process (as desired) vs. intermolecular rearrangement (MIDswitching).

FIG. 13 shows a diagram of exemplary workflows for preparing materialfor and performing the reflex process.

DEFINITIONS

Unless otherwise defined, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Still, certain elements aredefined for the sake of clarity and ease of reference.

Terms and symbols of nucleic acid chemistry, biochemistry, genetics, andmolecular biology used herein follow those of standard treatises andtexts in the field, e.g. Kornberg and Baker, DNA Replication, SecondEdition (W.H. Freeman, New York, 1992); Lehninger, Biochemistry, SecondEdition (Worth Publishers, New York, 1975); Strachan and Read, HumanMolecular Genetics, Second Edition (Wiley-Liss, New York, 1999);Eckstein, editor, Oligonucleotides and Analogs: A Practical Approach(Oxford University Press, New York, 1991); Gait, editor, OligonucleotideSynthesis: A Practical Approach (IRL Press, Oxford, 1984); and the like.

“Amplicon” means the product of a polynucleotide amplification reaction.That is, it is a population of polynucleotides, usually double stranded,that are replicated from one or more starting sequences. The one or morestarting sequences may be one or more copies of the same sequence, or itmay be a mixture of different sequences. Amplicons may be produced by avariety of amplification reactions whose products are multiplereplicates of one or more target nucleic acids. Generally, amplificationreactions producing amplicons are “template-driven” in that base pairingof reactants, either nucleotides or oligonucleotides, have complementsin a template polynucleotide that are required for the creation ofreaction products. In one aspect, template-driven reactions are primerextensions with a nucleic acid polymerase or oligonucleotide ligationswith a nucleic acid ligase. Such reactions include, but are not limitedto, polymerase chain reactions (PCRs), linear polymerase reactions,nucleic acid sequence-based amplification (NASBAs), rolling circleamplifications, and the like, disclosed in the following references thatare incorporated herein by reference: Mullis et al, U.S. Pat. Nos.4,683,195; 4,965,188; 4,683,202; 4,800,159 (PCR); Gelfand et al, U.S.Pat. No. 5,210,015 (real-time PCR with “TAQMAN™” probes); Wittwer et al,U.S. Pat. No. 6,174,670; Kacian et al, U.S. Pat. No. 5,399,491(“NASBA”); Lizardi, U.S. Pat. No. 5,854,033; Aono et al, Japanese patentpubl. JP 4-262799 (rolling circle amplification); and the like. In oneaspect, amplicons of the invention are produced by PCRs. Anamplification reaction may be a “real-time” amplification if a detectionchemistry is available that permits a reaction product to be measured asthe amplification reaction progresses, e.g. “real-time PCR” describedbelow, or “real-time NASBA” as described in Leone et al, Nucleic AcidsResearch, 26: 2150-2155 (1998), and like references. As used herein, theterm “amplifying” means performing an amplification reaction. A“reaction mixture” means a solution containing all the necessaryreactants for performing a reaction, which may include, but not belimited to, buffering agents to maintain pH at a selected level during areaction, salts, co-factors, scavengers, and the like.

The term “assessing” includes any form of measurement, and includesdetermining if an element is present or not. The terms “determining”,“measuring”, “evaluating”, “assessing” and “assaying” are usedinterchangeably and includes quantitative and qualitativedeterminations. Assessing may be relative or absolute. “Assessing thepresence of” includes determining the amount of something present,and/or determining whether it is present or absent. As used herein, theterms “determining,” “measuring,” and “assessing,” and “assaying” areused interchangeably and include both quantitative and qualitativedeterminations.

Polynucleotides that are “asymmetrically tagged” have left and rightadapter domains that are not identical. This process is referred togenerically as attaching adapters asymmetrically or asymmetricallytagging a polynucleotide, e.g., a polynucleotide fragment. Production ofpolynucleotides having asymmetric adapter termini may be achieved in anyconvenient manner. Exemplary asymmetric adapters are described in: U.S.Pat. Nos. 5,712,126 and 6,372,434; U.S. Patent Publications 2007/0128624and 2007/0172839; and PCT publication WO/2009/032167; all of which areincorporated by reference herein in their entirety. In certainembodiments, the asymmetric adapters employed are those described inU.S. patent application Ser. No. 12/432,080, filed on Apr. 29, 2009,incorporated herein by reference in its entirety.

As one example, a user of the subject invention may use an asymmetricadapter to tag polynucleotides. An “asymmetric adapter” is one that,when ligated to both ends of a double stranded nucleic acid fragment,will lead to the production of primer extension or amplificationproducts that have non-identical sequences flanking the genomic insertof interest. The ligation is usually followed by subsequent processingsteps so as to generate the non-identical terminal adapter sequences.For example, replication of an asymmetric adapter attached fragment(s)results in polynucleotide products in which there is at least onenucleic acid sequence difference, or nucleotide/nucleoside modification,between the terminal adapter sequences. Attaching adaptersasymmetrically to polynucleotides (e.g., polynucleotide fragments)results in polynucleotides that have one or more adapter sequences onone end (e.g., one or more region or domain, e.g., a primer site) thatare either not present or have a different nucleic acid sequence ascompared to the adapter sequence on the other end. It is noted that anadapter that is termed an “asymmetric adapter” is not necessarily itselfstructurally asymmetric, nor does the mere act of attaching anasymmetric adapter to a polynucleotide fragment render it immediatelyasymmetric. Rather, an asymmetric adapter-attached polynucleotide, whichhas an identical asymmetric adapter at each end, produces replicationproducts (or isolated single stranded polynucleotides) that areasymmetric with respect to the adapter sequences on opposite ends (e.g.,after at least one round of amplification/primer extension).

Any convenient asymmetric adapter, or process for attaching adaptersasymmetrically, may be employed in practicing the present invention.Exemplary asymmetric adapters are described in: U.S. Pat. Nos. 5,712,126and 6,372,434; U.S. Patent Publications 2007/0128624 and 2007/0172839;and PCT publication WO/2009/032167; all of which are incorporated byreference herein in their entirety. In certain embodiments, theasymmetric adapters employed are those described in U.S. patentapplication Ser. No. 12/432,080, filed on Apr. 29, 2009, incorporatedherein by reference in its entirety.

“Complementary” or “substantially complementary” refers to thehybridization or base pairing or the formation of a duplex betweennucleotides or nucleic acids, such as, for instance, between the twostrands of a double stranded DNA molecule or between an oligonucleotideprimer and a primer site on a single stranded nucleic acid.Complementary nucleotides are, generally, A and T (or A and U), or C andG. Two single stranded RNA or DNA molecules are said to be substantiallycomplementary when the nucleotides of one strand, optimally aligned andcompared and with appropriate nucleotide insertions or deletions, pairwith at least about 80% of the nucleotides of the other strand, usuallyat least about 90% to 95%, and more preferably from about 98 to 100%.Alternatively, substantial complementarity exists when an RNA or DNAstrand will hybridize under selective hybridization conditions to itscomplement. Typically, selective hybridization will occur when there isat least about 65% complementary over a stretch of at least 14 to 25nucleotides, preferably at least about 75%, more preferably at leastabout 90% complementary. See, M. Kanehisa Nucleic Acids Res. 12:203(1984), incorporated herein by reference.

“Duplex” means at least two oligonucleotides and/or polynucleotides thatare fully or partially complementary undergo Watson-Crick type basepairing among all or most of their nucleotides so that a stable complexis formed. The terms “annealing” and “hybridization” are usedinterchangeably to mean the formation of a stable duplex. “Perfectlymatched” in reference to a duplex means that the poly- oroligonucleotide strands making up the duplex form a double strandedstructure with one another such that every nucleotide in each strandundergoes Watson-Crick base pairing with a nucleotide in the otherstrand. A stable duplex can include Watson-Crick base pairing and/ornon-Watson-Crick base pairing between the strands of the duplex (wherebase pairing means the forming hydrogen bonds). In certain embodiments,a non-Watson-Crick base pair includes a nucleoside analog, such asdeoxyinosine, 2, 6-diaminopurine, PNAs, LNA's and the like. In certainembodiments, a non-Watson-Crick base pair includes a “wobble base”, suchas deoxyinosine, 8-oxo-dA, 8-oxo-dG and the like, where by “wobble base”is meant a nucleic acid base that can base pair with a first nucleotidebase in a complementary nucleic acid strand but that, when employed as atemplate strand for nucleic acid synthesis, leads to the incorporationof a second, different nucleotide base into the synthesizing strand(wobble bases are described in further detail below). A “mismatch” in aduplex between two oligonucleotides or polynucleotides means that a pairof nucleotides in the duplex fails to undergo Watson-Crick bonding.

“Genetic locus,” “locus,” or “locus of interest” in reference to agenome or target polynucleotide, means a contiguous sub-region orsegment of the genome or target polynucleotide. As used herein, geneticlocus, locus, or locus of interest may refer to the position of anucleotide, a gene or a portion of a gene in a genome, includingmitochondrial DNA or other non-chromosomal DNA (e.g., bacterialplasmid), or it may refer to any contiguous portion of genomic sequencewhether or not it is within, or associated with, a gene. A geneticlocus, locus, or locus of interest can be from a single nucleotide to asegment of a few hundred or a few thousand nucleotides in length ormore. In general, a locus of interest will have a reference sequenceassociated with it (see description of “reference sequence” below).

“Kit” refers to any delivery system for delivering materials or reagentsfor carrying out a method of the invention. In the context of reactionassays, such delivery systems include systems that allow for thestorage, transport, or delivery of reaction reagents (e.g., probes,enzymes, etc. in the appropriate containers) and/or supporting materials(e.g., buffers, written instructions for performing the assay etc.) fromone location to another. For example, kits include one or moreenclosures (e.g., boxes) containing the relevant reaction reagentsand/or supporting materials. Such contents may be delivered to theintended recipient together or separately. For example, a firstcontainer may contain an enzyme for use in an assay, while a secondcontainer contains probes.

“Ligation” means to form a covalent bond or linkage between the terminiof two or more nucleic acids, e.g. oligonucleotides and/orpolynucleotides, in a template-driven reaction. The nature of the bondor linkage may vary widely and the ligation may be carried outenzymatically or chemically. As used herein, ligations are usuallycarried out enzymatically to form a phosphodiester linkage between a 5′carbon of a terminal nucleotide of one oligonucleotide with 3′ carbon ofanother oligonucleotide. A variety of template-driven ligation reactionsare described in the following references, which are incorporated byreference: Whiteley et al, U.S. Pat. No. 4,883,750; Letsinger et al,U.S. Pat. No. 5,476,930; Fung et al, U.S. Pat. No. 5,593,826; Kool, U.S.Pat. No. 5,426,180; Landegren et al, U.S. Pat. No. 5,871,921; Xu andKool, Nucleic Acids Research, 27: 875-881 (1999); Higgins et al, Methodsin Enzymology, 68: 50-71 (1979); Engler et al, The Enzymes, 15: 3-29(1982); and Namsaraev, U.S. patent publication 2004/0110213.

“Multiplex Identifier” (MID) as used herein refers to a tag orcombination of tags associated with a polynucleotide whose identity(e.g., the tag DNA sequence) can be used to differentiatepolynucleotides in a sample. In certain embodiments, the MID on apolynucleotide is used to identify the source from which thepolynucleotide is derived. For example, a nucleic acid sample may be apool of polynucleotides derived from different sources, (e.g.,polynucleotides derived from different individuals, different tissues orcells, or polynucleotides isolated at different times points), where thepolynucleotides from each different source are tagged with a unique MID.As such, a MID provides a correlation between a polynucleotide and itssource. In certain embodiments, MIDs are employed to uniquely tag eachindividual polynucleotide in a sample. Identification of the number ofunique MIDs in a sample can provide a readout of how many individualpolynucleotides are present in the sample (or from how many originalpolynucleotides a manipulated polynucleotide sample was derived; see,e.g., U.S. Pat. No. 7,537,897, issued on May 26, 2009, incorporatedherein by reference in its entirety). MIDs can range in length from 2 to100 nucleotide bases or more and may include multiple subunits, whereeach different MID has a distinct identity and/or order of subunits.Exemplary nucleic acid tags that find use as MIDs are described in U.S.Pat. No. 7,544,473, issued on Jun. 6, 2009, and titled “Nucleic AcidAnalysis Using Sequence Tokens”, as well as U.S. Pat. No. 7,393,665,issued on Jul. 1, 2008, and titled “Methods and Compositions for Taggingand Identifying Polynucleotides”, both of which are incorporated hereinby reference in their entirety for their description of nucleic acidtags and their use in identifying polynucleotides. In certainembodiments, a set of MIDs employed to tag a plurality of samples neednot have any particular common property (e.g., Tm, length, basecomposition, etc.), as the methods described herein can accommodate awide variety of unique MID sets. It is emphasized here that MIDs needonly be unique within a given experiment. Thus, the same MID may be usedto tag a different sample being processed in a different experiment. Inaddition, in certain experiments, a user may use the same MID to tag asubset of different samples within the same experiment. For example, allsamples derived from individuals having a specific phenotype may betagged with the same MID, e.g., all samples derived from control (orwildtype) subjects can be tagged with a first MID while subjects havinga disease condition can be tagged with a second MID (different than thefirst MID). As another example, it may be desirable to tag differentsamples derived from the same source with different MIDs (e.g., samplesderived over time or derived from different sites within a tissue).Further, MIDs can be generated in a variety of different ways, e.g., bya combinatorial tagging approach in which one MID is attached byligation and a second MID is attached by primer extension. Thus, MIDscan be designed and implemented in a variety of different ways to trackpolynucleotide fragments during processing and analysis, and thus nolimitation in this regard is intended.

“Nucleoside” as used herein includes the natural nucleosides, including2′-deoxy and 2′-hydroxyl forms, e.g. as described in Kornberg and Baker,DNA Replication, 2nd Ed. (Freeman, San Francisco, 1992). “Analogs” inreference to nucleosides includes synthetic nucleosides having modifiedbase moieties and/or modified sugar moieties, e.g. described by Scheit,Nucleotide Analogs (John Wiley, New York, 1980); Uhlman and Peyman,Chemical Reviews, 90: 543-584 (1990), or the like, with the proviso thatthey are capable of specific hybridization. Such analogs includesynthetic nucleosides designed to enhance binding properties, reducecomplexity, increase specificity, and the like. Polynucleotidescomprising analogs with enhanced hybridization or nuclease resistanceproperties are described in Uhlman and Peyman (cited above); Crooke etal, Exp. Opin. Ther. Patents, 6: 855-870 (1996); Mesmaeker et al,Current Opinion in Structual Biology, 5: 343-355 (1995); and the like.Exemplary types of polynucleotides that are capable of enhancing duplexstability include oligonucleotide N3′→P5′ phosphoramidates (referred toherein as “amidates”), peptide nucleic acids (referred to herein as“PNAs”), oligo-2′-O-alkylribonucleotides, polynucleotides containing C-5propynylpyrimidines, locked nucleic acids (“LNAs”), and like compounds.Such oligonucleotides are either available commercially or may besynthesized using methods described in the literature.

“Polymerase chain reaction,” or “PCR,” means a reaction for the in vitroamplification of specific DNA sequences by the simultaneous primerextension of complementary strands of DNA. In other words, PCR is areaction for making multiple copies or replicates of a target nucleicacid flanked by primer sites, such reaction comprising one or morerepetitions of the following steps: (i) denaturing the target nucleicacid, (ii) annealing primers to the primer sites, and (iii) extendingthe primers by a nucleic acid polymerase in the presence of nucleosidetriphosphates. Usually, the reaction is cycled through differenttemperatures optimized for each step in a thermal cycler instrument.Particular temperatures, durations at each step, and rates of changebetween steps depend on many factors well-known to those of ordinaryskill in the art, e.g. exemplified by the references: McPherson et al,editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRLPress, Oxford, 1991 and 1995, respectively). For example, in aconventional PCR using Taq DNA polymerase, a double stranded targetnucleic acid may be denatured at a temperature >90° C., primers annealedat a temperature in the range 50-75° C., and primers extended at atemperature in the range 72-78° C. The term “PCR” encompasses derivativeforms of the reaction, including but not limited to, RT-PCR, real-timePCR, nested PCR, quantitative PCR, multiplexed PCR, and the like.Reaction volumes range from a few hundred nanoliters, e.g. 200 nL, to afew hundred μL, e.g. 200 μL. “Reverse transcription PCR,” or “RT-PCR,”means a PCR that is preceded by a reverse transcription reaction thatconverts a target RNA to a complementary single stranded DNA, which isthen amplified, e.g. Tecott et al, U.S. Pat. No. 5,168,038, which patentis incorporated herein by reference. “Real-time PCR” means a PCR forwhich the amount of reaction product, i.e. amplicon, is monitored as thereaction proceeds. There are many forms of real-time PCR that differmainly in the detection chemistries used for monitoring the reactionproduct, e.g. Gelfand et al, U.S. Pat. No. 5,210,015 (“TAQMAN™”);Wittwer et al, U.S. Pat. Nos. 6,174,670 and 6,569,627 (intercalatingdyes); Tyagi et al, U.S. Pat. No. 5,925,517 (molecular beacons); whichpatents are incorporated herein by reference. Detection chemistries forreal-time PCR are reviewed in Mackay et al, Nucleic Acids Research, 30:1292-1305 (2002), which is also incorporated herein by reference.“Nested PCR” means a two-stage PCR wherein the amplicon of a first PCRbecomes the sample for a second PCR using a new set of primers, at leastone of which binds to an interior location of the first amplicon. Asused herein, “initial primers” in reference to a nested amplificationreaction mean the primers used to generate a first amplicon, and“secondary primers” mean the one or more primers used to generate asecond, or nested, amplicon. “Multiplexed PCR” means a PCR whereinmultiple target sequences (or a single target sequence and one or morereference sequences) are simultaneously carried out in the same reactionmixture, e.g. Bernard et al, Anal. Biochem., 273: 221-228(1999)(two-color real-time PCR). Usually, distinct sets of primers areemployed for each sequence being amplified.

“Quantitative PCR” means a PCR designed to measure the abundance of oneor more specific target sequences in a sample or specimen. QuantitativePCR includes both absolute quantitation and relative quantitation ofsuch target sequences. Quantitative measurements are made using one ormore reference sequences that may be assayed separately or together witha target sequence. The reference sequence may be endogenous or exogenousto a sample or specimen, and in the latter case, may comprise one ormore competitor templates. Typical endogenous reference sequencesinclude segments of transcripts of the following genes: β-actin, GAPDH,β₂-microglobulin, ribosomal RNA, and the like. Techniques forquantitative PCR are well-known to those of ordinary skill in the art,as exemplified in the following references that are incorporated byreference: Freeman et al, Biotechniques, 26: 112-126 (1999);Becker-Andre et al, Nucleic Acids Research, 17: 9437-9447 (1989);Zimmerman et al, Biotechniques, 21: 268-279 (1996); Diviacco et al,Gene, 122: 3013-3020 (1992); Becker-Andre et al, Nucleic Acids Research,17: 9437-9446 (1989); and the like.

“Polynucleotide” or “oligonucleotide” is used interchangeably and eachmeans a linear polymer of nucleotide monomers. Monomers making uppolynucleotides and oligonucleotides are capable of specifically bindingto a natural polynucleotide by way of a regular pattern ofmonomer-to-monomer interactions, such as Watson-Crick type of basepairing, base stacking, Hoogsteen or reverse Hoogsteen types of basepairing, wobble base pairing, or the like. As described in detail below,by “wobble base” is meant a nucleic acid base that can base pair with afirst nucleotide base in a complementary nucleic acid strand but that,when employed as a template strand for nucleic acid synthesis, leads tothe incorporation of a second, different nucleotide base into thesynthesizing strand. Such monomers and their internucleosidic linkagesmay be naturally occurring or may be analogs thereof, e.g. naturallyoccurring or non-naturally occurring analogs. Non-naturally occurringanalogs may include peptide nucleic acids (PNAs, e.g., as described inU.S. Pat. No. 5,539,082, incorporated herein by reference), lockednucleic acids (LNAs, e.g., as described in U.S. Pat. No. 6,670,461,incorporated herein by reference), phosphorothioate internucleosidiclinkages, bases containing linking groups permitting the attachment oflabels, such as fluorophores, or haptens, and the like. Whenever the useof an oligonucleotide or polynucleotide requires enzymatic processing,such as extension by a polymerase, ligation by a ligase, or the like,one of ordinary skill would understand that oligonucleotides orpolynucleotides in those instances would not contain certain analogs ofinternucleosidic linkages, sugar moieties, or bases at any or somepositions. Polynucleotides typically range in size from a few monomericunits, e.g. 5-40, when they are usually referred to as“oligonucleotides,” to several thousand monomeric units. Whenever apolynucleotide or oligonucleotide is represented by a sequence ofletters (upper or lower case), such as “ATGCCTG,” it will be understoodthat the nucleotides are in 5′->3′ order from left to right and that “A”denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotesdeoxyguanosine, and “T” denotes thymidine, “I” denotes deoxyinosine, “U”denotes uridine, unless otherwise indicated or obvious from context.Unless otherwise noted the terminology and atom numbering conventionswill follow those disclosed in Strachan and Read, Human MolecularGenetics 2 (Wiley-Liss, New York, 1999). Usually polynucleotidescomprise the four natural nucleosides (e.g. deoxyadenosine,deoxycytidine, deoxyguanosine, deoxythymidine for DNA or their ribosecounterparts for RNA) linked by phosphodiester linkages; however, theymay also comprise non-natural nucleotide analogs, e.g. includingmodified bases, sugars, or internucleosidic linkages. It is clear tothose skilled in the art that where an enzyme has specificoligonucleotide or polynucleotide substrate requirements for activity,e.g. single stranded DNA, RNA/DNA duplex, or the like, then selection ofappropriate composition for the oligonucleotide or polynucleotidesubstrates is well within the knowledge of one of ordinary skill,especially with guidance from treatises, such as Sambrook et al,Molecular Cloning, Second Edition (Cold Spring Harbor Laboratory, NewYork, 1989), and like references.

“Primer” means an oligonucleotide, either natural or synthetic, that iscapable, upon forming a duplex with a polynucleotide template, of actingas a point of initiation of nucleic acid synthesis and being extendedfrom its 3′ end along the template so that an extended duplex is formed.The sequence of nucleotides added during the extension process isdetermined by the sequence of the template polynucleotide. Usuallyprimers are extended by a DNA polymerase. Primers are generally of alength compatible with their use in synthesis of primer extensionproducts, and are usually are in the range of between 8 to 100nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30,20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in therange of between 18-40, 20-35, 21-30 nucleotides long, and any lengthbetween the stated ranges. Typical primers can be in the range ofbetween 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 andso on, and any length between the stated ranges. In some embodiments,the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23,24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70nucleotides in length.

Primers are usually single-stranded for maximum efficiency inamplification, but may alternatively be double-stranded. Ifdouble-stranded, the primer is usually first treated to separate itsstrands before being used to prepare extension products. Thisdenaturation step is typically affected by heat, but may alternativelybe carried out using alkali, followed by neutralization. Thus, a“primer” is complementary to a template, and complexes by hydrogenbonding or hybridization with the template to give a primer/templatecomplex for initiation of synthesis by a polymerase, which is extendedby the addition of covalently bonded bases linked at its 3′ endcomplementary to the template in the process of DNA synthesis.

A “primer pair” as used herein refers to first and second primers havingnucleic acid sequence suitable for nucleic acid-based amplification of atarget nucleic acid. Such primer pairs generally include a first primerhaving a sequence that is the same or similar to that of a first portionof a target nucleic acid, and a second primer having a sequence that iscomplementary to a second portion of a target nucleic acid to providefor amplification of the target nucleic acid or a fragment thereof.Reference to “first” and “second” primers herein is arbitrary, unlessspecifically indicated otherwise. For example, the first primer can bedesigned as a “forward primer” (which initiates nucleic acid synthesisfrom a 5′ end of the target nucleic acid) or as a “reverse primer”(which initiates nucleic acid synthesis from a 5′ end of the extensionproduct produced from synthesis initiated from the forward primer).Likewise, the second primer can be designed as a forward primer or areverse primer.

“Primer site” (e.g., a sequencing primer site, and amplification primersite, etc.) as used herein refers to a domain in a polynucleotide thatincludes the sequence of a primer (e.g., a sequencing primer) and/or thecomplementary sequence of a primer. When present in single stranded form(e.g., in a single stranded polynucleotide), a primer site can be eitherthe identical sequence of a primer or the complementary sequence of aprimer. When present in double stranded form, a primer site contains thesequence of a primer hybridized to the complementary sequence of theprimer. Thus, a primer site is a region of a polynucleotide that iseither identical to or complementary to the sequence of a primer (whenin a single stranded form) or a double stranded region formed between aprimer sequence and its complement. Primer sites may be present in anadapter attached to a polynucleotide. The specific orientation of aprimer site can be inferred by those of ordinary skill in the art fromthe structural features of the relevant polynucleotide and/or context inwhich it is used.

“Readout” means a parameter, or parameters, which are measured and/ordetected that can be converted to a number or value. In some contexts,readout may refer to an actual numerical representation of suchcollected or recorded data. For example, a readout of fluorescentintensity signals from a microarray is the address and fluorescenceintensity of a signal being generated at each hybridization site of themicroarray; thus, such a readout may be registered or stored in variousways, for example, as an image of the microarray, as a table of numbers,or the like.

“Reflex site”, “reflex sequence” and equivalents are used to indicatesequences in a polynucleotide that are employed to move a domainintramolecularly from its initial location to a different location inthe polynucleotide. The sequence of a reflex site can be added to apolynucleotide of interest (e.g., present in an adapter ligated to thepolynucleotide), be based on a sequence naturally present within thepolynucleotide of interest (e.g., a genomic sequence in thepolynucleotide), or a combination of both. The reflex sequence is chosenso as to be distinct from other sequences in the polynucleotide (i.e.,with little sequence homology to other sequences likely to be present inthe polynucleotide, e.g., genomic or sub-genomic sequences to beprocessed). As such, a reflex sequence should be selected so as to nothybridize to any sequence except its complement under the conditionsemployed in the reflex processes herein described. As described later inthis application, the complement to the reflex sequence is inserted onthe same strand of the polynucleotide (e.g., the same strand of adouble-stranded polynucleotide or on the same single strandedpolynucleotide) in a particular location so as to facilitate anintramolecular binding event on such particular strand. Reflex sequencesemployed in the reflex process described herein can thus have a widerange of lengths and sequences. Reflex sequences may range from 5 to 200nucleotide bases in length.

“Solid support”, “support”, and “solid phase support” are usedinterchangeably and refer to a material or group of materials having arigid or semi-rigid surface or surfaces. In many embodiments, at leastone surface of the solid support will be substantially flat, although insome embodiments it may be desirable to physically separate synthesisregions for different compounds with, for example, wells, raisedregions, pins, etched trenches, or the like. According to otherembodiments, the solid support(s) will take the form of beads, resins,gels, microspheres, or other geometric configurations. Microarraysusually comprise at least one planar solid phase support, such as aglass microscope slide.

“Specific” or “specificity” in reference to the binding of one moleculeto another molecule, such as a labeled target sequence for a probe,means the recognition, contact, and formation of a stable complexbetween the two molecules, together with substantially less recognition,contact, or complex formation of that molecule with other molecules. Inone aspect, “specific” in reference to the binding of a first moleculeto a second molecule means that to the extent the first moleculerecognizes and forms a complex with another molecule in a reaction orsample, it forms the largest number of the complexes with the secondmolecule. Preferably, this largest number is at least fifty percent.Generally, molecules involved in a specific binding event have areas ontheir surfaces or in cavities giving rise to specific recognitionbetween the molecules binding to each other. Examples of specificbinding include antibody-antigen interactions, enzyme-substrateinteractions, formation of duplexes or triplexes among polynucleotidesand/or oligonucleotides, biotin-avidin or biotin-streptavidininteractions, receptor-ligand interactions, and the like. As usedherein, “contact” in reference to specificity or specific binding meanstwo molecules are close enough that weak noncovalent chemicalinteractions, such as Van der Waal forces, hydrogen bonding,base-stacking interactions, ionic and hydrophobic interactions, and thelike, dominate the interaction of the molecules.

As used herein, the term “T_(m)” is used in reference to the “meltingtemperature.” The melting temperature is the temperature (e.g., asmeasured in ° C.) at which a population of double-stranded nucleic acidmolecules becomes half dissociated into single strands. Severalequations for calculating the T_(m) of nucleic acids are known in theart (see e.g., Anderson and Young, Quantitative Filter Hybridization, inNucleic Acid Hybridization (1985). Other references (e.g., Allawi, H. T.& SantaLucia, J., Jr., Biochemistry 36, 10581-94 (1997)) includealternative methods of computation which take structural andenvironmental, as well as sequence characteristics into account for thecalculation of T_(m).

“Sample” means a quantity of material from a biological, environmental,medical, or patient source in which detection, measurement, or labelingof target nucleic acids is sought. On the one hand it is meant toinclude a specimen or culture (e.g., microbiological cultures). On theother hand, it is meant to include both biological and environmentalsamples. A sample may include a specimen of synthetic origin. Biologicalsamples may be animal, including human, fluid, solid (e.g., stool) ortissue, as well as liquid and solid food and feed products andingredients such as dairy items, vegetables, meat and meat by-products,and waste. Biological samples may include materials taken from a patientincluding, but not limited to cultures, blood, saliva, cerebral spinalfluid, pleural fluid, milk, lymph, sputum, semen, needle aspirates, andthe like. Biological samples may be obtained from all of the variousfamilies of domestic animals, as well as feral or wild animals,including, but not limited to, such animals as ungulates, bear, fish,rodents, etc. Environmental samples include environmental material suchas surface matter, soil, water and industrial samples, as well assamples obtained from food and dairy processing instruments, apparatus,equipment, utensils, disposable and non-disposable items. These examplesare not to be construed as limiting the sample types applicable to thepresent invention.

The terms “upstream” and “downstream” in describing nucleic acidmolecule orientation and/or polymerization are used herein as understoodby one of skill in the art. As such, “downstream” generally meansproceeding in the 5′ to 3′ direction, i.e., the direction in which anucleotide polymerase normally extends a sequence, and “upstream”generally means the converse. For example, a first primer thathybridizes “upstream” of a second primer on the same target nucleic acidmolecule is located on the 5′ side of the second primer (and thusnucleic acid polymerization from the first primer proceeds towards thesecond primer).

It is further noted that the claims may be drafted to exclude anyoptional element. As such, this statement is intended to serve asantecedent basis for use of such exclusive terminology as “solely”,“only” and the like in connection with the recitation of claim elements,or the use of a “negative” limitation.

DETAILED DESCRIPTION OF THE INVENTION

The invention is drawn to compositions and methods for intramolecularnucleic acid rearrangement that find use in various applications ofgenetic analysis, including sequencing, as well as general molecularbiological manipulations of polynucleotide structures.

Before the present invention is described, it is to be understood thatthis invention is not limited to particular embodiments described, assuch may, of course, vary. It is also to be understood that theterminology used herein is for the purpose of describing particularembodiments only, and is not intended to be limiting, since the scope ofthe present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that eachintervening value, to the tenth of the unit of the lower limit unlessthe context clearly dictates otherwise, between the upper and lowerlimits of that range is also specifically disclosed. Each smaller rangebetween any stated value or intervening value in a stated range and anyother stated or intervening value in that stated range is encompassedwithin the invention. The upper and lower limits of these smaller rangesmay independently be included or excluded in the range, and each rangewhere either, neither or both limits are included in the smaller rangesis also encompassed within the invention, subject to any specificallyexcluded limit in the stated range. Where the stated range includes oneor both of the limits, ranges excluding either or both of those includedlimits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of the present invention, some potential andpreferred methods and materials are now described. All publicationsmentioned herein are incorporated herein by reference to disclose anddescribe the methods and/or materials in connection with which thepublications are cited. It is understood that the present disclosuresupersedes any disclosure of an incorporated publication to the extentthere is a contradiction.

It must be noted that as used herein and in the appended claims, thesingular forms “a”, “an”, and “the” include plural referents unless thecontext clearly dictates otherwise. Thus, for example, reference to “anucleic acid” includes a plurality of such nucleic acids and referenceto “the compound” includes reference to one or more compounds andequivalents thereof known to those skilled in the art, and so forth.

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the example herein below. However,other equivalent conventional procedures can, of course, also be used.Such conventional techniques and descriptions can be found in standardlaboratory manuals such as Genome Analysis: A Laboratory Manual Series(Vol s. I-IV), Using Antibodies: A Laboratory Manual, Cells: ALaboratory Manual, PCR Primer: A Laboratory Manual, and MolecularCloning: A Laboratory Manual (all from Cold Spring Harbor LaboratoryPress), Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York,Gait, “Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, A., Principles of Biochemistry3^(rd) Ed., W. H. Freeman Pub., New York, N.Y. and Berg et al. (2002)Biochemistry, 5^(th) Ed., W. H. Freeman Pub., New York, N.Y., all ofwhich are herein incorporated in their entirety by reference for allpurposes.

The publications discussed herein are provided solely for theirdisclosure prior to the filing date of the present application. Nothingherein is to be construed as an admission that the present invention isnot entitled to antedate such publication by virtue of prior invention.Further, the dates of publication provided may be different from theactual publication dates which may need to be independently confirmed.

As summarized above, aspects of the present invention are drawn to theuse of a ‘reflex’ sequence present in a polynucleotide (e.g., in anadapter structure of the polynucleotide, in a genomic region of thepolynucleotide, or a combination of both) to move a domain of thepolynucleotide intra-molecularly from a first location to a secondlocation. The reflex process described herein finds use in any number ofapplications, e.g., placing functional elements of a polynucleotide(e.g., sequencing primer sites and/or MID tags) into proximity to adesired sub-region of interest.

Nucleic Acids

The reflex process (as described in detail below) can be employed forthe manipulation and analysis of nucleic acid sequences of interest fromvirtually any nucleic acid source, including but not limited to genomicDNA, complementary DNA (cDNA), RNA (e.g., messenger RNA, ribosomal RNA,short interfering RNA, microRNA, etc.), plasmid DNA, mitochondrial DNA,synthetic DNA, etc. Furthermore, any organism, organic material ornucleic acid-containing substance can be used as a source of nucleicacids to be processed in accordance with the present inventionincluding, but not limited to, plants, animals (e.g., reptiles, mammals,insects, worms, fish, etc.), tissue samples, bacteria, fungi (e.g.,yeast), phage, viruses, cadaveric tissue, archaeological/ancientsamples, etc. In certain embodiments, the nucleic acids in the nucleicacid sample are derived from a mammal, where in certain embodiments themammal is a human.

In certain embodiments, the nucleic acid sequences are enriched prior tothe reflex sequence process. By enriched is meant that the nucleic acidis subjected to a process that reduces the complexity of the nucleicacids, generally by increasing the relative concentration of particularnucleic acid species in the sample (e.g., having a specific locus ofinterest, including a specific nucleic acid sequence, lacking a locus orsequence, being within a specific size range, etc.). There are a widevariety of ways to enrich nucleic acids having a specificcharacteristic(s) or sequence, and as such any convenient method toaccomplish this may be employed. The enrichment (or complexityreduction) can take place at any of a number of steps in the process,and will be determined by the desires of the user. For example,enrichment can take place in individual parental samples (e.g., untaggednucleic acids prior to adaptor ligation) or in multiplexed samples(e.g., nucleic acids tagged with primer sites, MID and/or reflexsequences and pooled; MID are described in further detail below).

In certain embodiments, nucleic acids in the nucleic acid sample areamplified prior to analysis. In certain of these embodiments, theamplification reaction also serves to enrich a starting nucleic acidsample for a sequence or locus of interest. For example, a startingnucleic acid sample can be subjected to a polymerase chain reaction(PCR) that amplifies one or more region of interest. In certainembodiments, the amplification reaction is an exponential amplificationreaction, whereas in certain other embodiments, the amplificationreaction is a linear amplification reaction. Any convenient method forperforming amplification reactions on a starting nucleic acid sample canbe used in practicing the subject invention. In certain embodiments, thenucleic acid polymerase employed in the amplification reaction is apolymerase that has proofreading capability (e.g., phi29 DNA Polymerase,Thermococcus litoralis DNA polymerase, Pyrococcus furiosus DNApolymerase, etc.).

In certain embodiments, the nucleic acid sample being analyzed isderived from a single source (e.g., a single organism, virus, tissue,cell, subject, etc.), whereas in other embodiments, the nucleic acidsample is a pool of nucleic acids extracted from a plurality of sources(e.g., a pool of nucleic acids from a plurality of organisms, tissues,cells, subjects, etc.), where by “plurality” is meant two or more. Assuch, in certain embodiments, a nucleic acid sample can contain nucleicacids from 2 or more sources, 3 or more sources, 5 or more sources, 10or more sources, 50 or more sources, 100 or more sources, 500 or moresources, 1000 or more sources, 5000 or more sources, up to and includingabout 10,000 or more sources.

In certain embodiments, nucleic acid fragments that are to be pooledwith nucleic acid fragments derived from a plurality of sources (e.g., aplurality of organisms, tissues, cells, subjects, etc.), where by“plurality” is meant two or more. In such embodiments, the nucleic acidsderived from each source includes a multiplex identifier (MID) such thatthe source from which the each tagged nucleic acid fragment was derivedcan be determined. In such embodiments, each nucleic acid sample sourceis correlated with a unique MID, where by unique MID is meant that eachdifferent MID employed can be differentiated from every other MIDemployed by virtue of at least one characteristic, e.g., the nucleicacid sequence of the MID. Any type of MID can be used, including but notlimited to those described in co-pending U.S. patent application Ser.No. 11/656,746, filed on Jan. 22, 2007, and titled “Nucleic AcidAnalysis Using Sequence Tokens”, as well as U.S. Pat. No. 7,393,665,issued on Jul. 1, 2008, and titled “Methods and Compositions for Taggingand Identifying Polynucleotides”, both of which are incorporated hereinby reference in their entirety for their description of nucleic acidtags and their use in identifying polynucleotides. In certainembodiments, a set of MIDs employed to tag a plurality of samples neednot have any particular common property (e.g., T_(m), length, basecomposition, etc.), as the asymmetric tagging methods (and many tagreadout methods, including but not limited to sequencing of the tag ormeasuring the length of the tag) can accommodate a wide variety ofunique MID sets.

In certain embodiments, each individual polynucleotide (e.g.,double-stranded or single-stranded, as appropriate to the methodologicaldetails employed) in a sample to be analyzed is tagged with a unique MIDso that the fate of each polynucleotide can be tracked in subsequentprocesses (where, as noted above, unique MID is meant to indicate thateach different MID employed can be differentiated from every other MIDemployed by virtue of at least one characteristic, e.g., the nucleicacid sequence of the MID). For example (and as described below), havingeach nucleic acid tagged with a unique MID allows analysis of thesequence of each individual nucleic acid using the reflex sequencemethods described herein. This allows the linkage of sequenceinformation for large nucleic acid fragments that cannot be sequenced ina single sequencing run.

Reflex Sequence Process

As summarized above, aspects of the present invention include methodsand compositions for moving a domain in a polynucleotide from a firstlocation to a second location in the polynucleotide. An exemplaryembodiment is shown in FIG. 1A.

FIG. 1A shows a single stranded polynucleotide 100 comprising, in a 5′to 3′ orientation, a first domain (102; the domain to be moved); areflex sequence 104; a nucleic acid sequence 106 having a site distal tothe first domain (Site A), and a complement of the reflex sequence 108(positioned at the 3′ terminus of the polynucleotide). The steps of thereflex method described below will move the first domain into closerproximity to Site A. It is noted here that the prime designation in FIG.1A denotes a complementary sequence of a domain. For example, FirstDomain′ is the complement of the First Domain.

In Step 1, the reflex sequence and its complement in the polynucleotideare annealed intramolecularly to form polynucleotide structure 112, withthe polynucleotide folding back on itself and hybridizing to form aregion of complementarity (i.e., double stranded reflex/reflex′ region).In this configuration, the 3′ end of the complement of the reflexsequence can serve as a nucleic acid synthesis priming site. Nucleicacid synthesis from this site is then performed in extension Step 2producing a complement of the first domain at the 3′ end of the nucleicacid extension (shown in polynucleotide 114; extension is indicated withdotted arrow labeled “extend”).

Denaturation of polynucleotide 114 (e.g., by heat) generates linearsingle stranded polynucleotide 116. As shown in FIG. 1, resultantpolynucleotide 116 contains a complement of the first domain at aposition proximal to Site A (i.e., separated by only the complement ofthe reflex sequence). This resultant polynucleotide may be used for anysubsequent analysis or processing steps as desired by the user (e.g.,sequencing, as a template for amplification (linear, PCR, etc.),sequence specific extraction, etc.).

In alternative embodiments, the first domain and reflex sequence areremoved from the 5′ end of the double-stranded region of polynucleotide114 (shown in polynucleotide 118; removal is shown in the dotted arrowlabeled “remove”). Removal of this region may be accomplished by anyconvenient method, including, but not limited to, treatment (underappropriate incubation conditions) of polynucleotide structure 114 withT7 exonuclease or by treatment with Lambda exonuclease; the Lambdaexonuclease can be employed so long as the 5′ end of the polynucleotideis phosphorylated. If the region is removed enzymatically, resultantpolynucleotide 118 is used in place of polynucleotide 116 in subsequentsteps (e.g., copying to reverse polarity).

In certain embodiments, polynucleotide 116 or 118 is used as a templateto produce a double stranded polynucleotide, for example by performing anucleic acid synthesis reaction with a primer that primes in thecomplement of the first domain. This step is sometimes referred to ascopying to reverse polarity of a single stranded polynucleotide, and insome instances, the double-stranded intermediate product of this copyingis not shown (see, e.g., FIG. 3). For example, copying to reverse thepolarity of polynucleotide 116 results in single-stranded polynucleotide120 having, in a 5′ to 3′ orientation, the first domain (122); thereflex sequence (124); the complement of polynucleotide 106 (orientedwith the complement of Site A (Site A′; 126) proximal to the reflexsequence); the complement of the reflex sequence (128); and thecomplement of the first domain (130).

In certain embodiments, the first domain in the polynucleotide comprisesone or more elements that find use in one or more subsequent processingor analysis steps. Such sequences include, but are not limited to,restriction enzyme sites, PCR primer sites, linear amplification primersites, reverse transcription primer sites, RNA polymerase promoter sites(such as for T7, T3 or SP6 RNA polymerase), MID tags, sequencing primersites, etc. Any convenient element can be included in the first domainand, in certain embodiments, is determined by the desires of the user ofthe methods described herein.

As an exemplary embodiment, suppose we want to sequence a specificpolynucleotide region from multiple genomes in a pooled sample where thepolynucleotide region is too long to sequence in a single reaction. Forexample, sequencing a polynucleotide region that is 2 kilobases or morein length using Roche 454 (Branford, Conn.) technology, in which thelength of a single sequencing run is about 400 bases. In this scenario,we can design a set of left hand primers (A_(n)) and right hand primers(B₁) specific for the polynucleotide region that are positioned in sucha way that we can obtain direct sequences of all parts of the insert, asshown in FIG. 1B. Note that the polynucleotide shown in FIG. 1B (140)has a domain (142) containing a primer site and an MID denoting fromwhich original sample(s) the polynucleotide is derived. Site 142 thusrepresents an example of a First Domain site such identified as 122 inthe FIG. 1A. The polynucleotide also includes a reflex site (144), whichcan be part of the polynucleotide region itself (e.g., a genomicsequence), added in a ligated adapter domain along with the primer siteand the MID (an artificial sequence), or a combination of both (asequence spanning the adapter/polynucleotide junction).

It is noted here that polynucleotide 140 can be categorized as aprecursor to polynucleotide 100 in FIG. 1A, as it does not include a 3′reflex sequence complementary to the reflex site (domain 108 in FIG.1A). As detailed below, polynucleotide 140 can be converted to apolynucleotide having the structural configuration of polynucleotide100, a polynucleotide suitable as a substrate for the reflex processdescribed herein (e.g., by primer extension using a B_(n) primer andreversal of polarity).

In an exemplary embodiment, each A_(n)-B_(n) primer pair defines anucleic acid region that is approximately 400 bases in length or less.This size range is within the single-sequencing run read length of thecurrent Roche 454 sequencing platform; a different size range for thedefined nucleic acid region may be utilized for a different sequencingplatform. Thus, each product from each reflex process can be sequencedin a single run. It is noted here that primer pairs as shown in FIG. 1Bcan be used to define regions 1 to 5 shown in FIG. 3 (described infurther detail below).

In certain embodiments, to obtain the first part of the sequence of thepolynucleotide region (i.e., in the original structure, that part of thepolynucleotide closest to the first domain), we only need a right handprimer (e.g., B₀) and we do not need to transfer the MID as it is withinreach of this sequencing primer (i.e., the MID is within 400 bases ofsequencing primer B₀). All other B_(n) primers have the reflex sequenceadded to their 5′ ends (“R” element shown on B primers) so that theyread 5′ reflex-B_(n). However, in certain embodiments, the B₀ primerdoes include the reflex sequence and is used in the reflex process(along with a corresponding A₀ primer) as detailed below.

As described above, we obtain a single stranded polynucleotide having,in the 5′ to 3′ orientation, a primer site (e.g., for Roche 454sequencing), an MID, a reflex sequence and the polynucleotide to besequenced. Numerous methods for obtaining single-strandedpolynucleotides of interest have been described and are known in theart, including in U.S. Pat. No. 7,217,522, issued on May 15, 2007; U.S.patent application Ser. No. 11/377,462, filed on Mar. 16, 2006; and U.S.patent application Ser. No. 12/432,080, filed on Apr. 29, 2009; each ofwhich is incorporated by reference herein in their entirety. Forexample, a single stranded product can be produced using linearamplification with a primer specific for the primer site of thetemplate. In certain embodiments, the primer includes a binding moietyto facilitate isolation of the single stranded nucleic acid of interest,e.g., to immobilize the top strand on a binding partner of the bindingmoiety immobilized on a solid support. Removal of a hybridized,non-biotinylated strand by denaturation using heat or high pH (or anyother convenient method) serves to isolate the biotinylated strand.Binding moieties and their corresponding binding partners are sometimesreferred to herein as binding partner pairs. Any convenient bindingpartner pairs may be used, including but not limited to biotin/avidin(or streptavidin), antigen/antibody pairs, etc.

It is noted here that while the figures and description of the reflexprocess provided herein depict manipulations with regard to a singlestranded polynucleotide, it is not necessarily required that the singlestranded polynucleotide described or depicted in the figures be presentin the sample in an isolated form (i.e., isolated from its complementarystrand). In other words, double stranded polynucleotides may be usedwhere only one strand is described/depicted, which will generally bedetermined by the user.

The implementation of a single strand isolation step using the methodsdescribed above or variations thereof (or any other convenient singlestrand isolation step) will generally be based on the desires of theuser. One example of isolating single stranded polynucleotides is shownin FIG. 2. In this Figure, a starting double stranded template (with 5′to 3′ orientation shown as an arrow) is denatured and primed with abiotinylated synthesis primer specific for the primer site. Afterextension of the primer (i.e., nucleic acid synthesis), the sample iscontacted with a solid support having streptavidin bound to it. Thebiotin moiety (i.e., the binding partner of streptavidin) on theextended strands will bind to the solid-phase streptavidin. Denaturationand washing is then performed to remove all non-biotinylatedpolynucleotide strands. If desired, the bound polynucleotide, which canbe used in subsequent reflex process steps (e.g., as a template forB_(n) primer extension reactions), may be eluted from the streptavidinsupport. Alternatively, the bound polynucleotide may be employed insubsequent steps of the desired process while still bound to the solidsupport (e.g., in solid phase extension reactions using B_(n) primers).This process, with minor variations depending on the template being usedand the identity of the desired single stranded polynucleotide, may beemployed at any of a number of steps in which a single stranded productis to be isolated. It is noted that in certain embodiments, substratebound biotinylated polynucleotide can be used to produce and isolatenon-biotinylated single stranded products (i.e., by eluting thenon-biotinylated products while leaving the biotinylated templates boundto the streptavidin on the solid support). Thus, the specifics of howbinding partners are used to isolate single stranded polynucleotides ofinterest will vary depending on experimental design parameters.

Additional single-stranded isolation/production methods includeasymmetric PCR, strand-specific enzymatic degradation, and the use ofin-vitro transcription followed by reverse transcriptase (IVT-RT) withsubsequent destruction of the RNA strand. As noted above, any convenientsingle stranded production/isolation method may be employed.

To the single stranded polynucleotide shown in FIG. 1B we anneal one ofthe B_(n) primers having the appended reflex sequence, denoted with acapital “R” (e.g., B₁) and extend the primer under nucleic acidsynthesis conditions to produce a copy of the polynucleotide that has areflex sequence at its 5′ end. A single stranded copy of thispolynucleotide is then produced to reverse polarity using a primerspecific for the primer site in the first domain′ (complement of thefirst domain 102). The resulting nucleic acid has structure 100 shown inFIG. 1A, where the first domain 102 includes the primer site and theMID. Site A (110) in FIG. 1 is determined by the specificity of the 5′reflex-B_(n) primer used.

The reflex process (e.g., as shown in FIG. 1) is then performed toproduce a product in which the primer site and the MID are now in closeproximity to the desired site (or region of interest (ROI)) within theoriginal polynucleotide (i.e., the site defined by the primer used,e.g., B₁). The resulting polynucleotide can be used in subsequentanalyses as desired by the user (e.g., Roche 454 sequencing technology).

It is noted here that, while not shown in FIGS. 1A and 1B, anyconvenient method for adding adapters to a polynucleotide to beprocessed as described herein may be used in the practice of the reflexprocess (adapters containing, e.g., primer sites, polymerase sites,MIDs, restriction enzyme sites, and reflex sequences). For example,adapters can be added at a particular position by ligation. For doublestranded polynucleotides, an adapter can be configured to be ligated toa particular restriction enzyme cut site. Where a single strandedpolynucleotide is employed, a double stranded adapter construct thatpossesses an overhang configured to bind to the end of thesingle-stranded polynucleotide can be used. For example, in the lattercase, the end of a single stranded polynucleotide can be modified toinclude specific nucleotide bases that are complementary to the overhangin the double stranded adaptor using terminal transferase and specificnucleotides. In other embodiments, PCR or linear amplification methodsusing adapter-conjugated primers is employed to add an adapter at a siteof interest. Again, any convenient method for producing a startingpolynucleotide may be employed in practicing the methods of the subjectinvention.

In certain embodiments, the nucleic acid may be sequenced directly usinga sequencing primer specific for the primer site. This sequencingreaction will read through the MID and desired site in the insert.

In certain embodiments, the polynucleotide may be isolated (orfractionated) using an appropriate A_(n) primer (e.g., when using B₁ asthe first primer, primer A₁ can be used). In certain embodiments, theA_(n) primed polynucleotide is subjected to nucleic acid synthesisconditions to produce a copy of the fragment produced in the reflexprocess. In certain of these embodiments, the A_(n) primer has appendedon its 5′ end a primer site that can be used in subsequent steps,including sequencing reactions. Providing a primer site in the A_(n)primer allows amplifying and/or sequencing from both ends of theresultant fragment: from the primer site in the first domain 102 and theprimer site in the A_(n) primer (not shown in FIG. 1i ). Because of theposition of the primer sites and their distance apart (i.e., less thanone sequencing run apart), sequencing from both ends will usuallycapture the sequence of the desired site (or ROI) and the sequence ofthe MID, which can be used for subsequent bioinformatic analyses, e.g.,to positively identify the sample of origin. It is noted here that whilesequencing in both directions is possible, it is not necessary, assequencing from either primer site alone will capture the sequence ofthe ROI as well as its corresponding MID sequence.

Note that in certain embodiments, the first fragment obtained byamplification/extension from primer B₀ directly, the polarity of the ROIin the resulting fragment is reversed as compared to the ROI infragments obtained by primers B₁-B_(n). This is because the B₀-generatedfragment, unlike the B₁-B_(n) generated fragments, has not beensubjected to a reflex process which reverses the orientation of the ROIsequence with respect to the first domain/reflex sequence (as describedabove). Therefore, the B₀ primer may have appended to it a primer site(e.g., at its 5′ end) that can be used for subsequent amplificationand/or sequencing reactions (e.g., in Roche 454 sequencing system)rather than a reflex sequence as with primers B₁-B_(n). However, incertain embodiments, as noted above, the reflex process may be used witha corresponding B₀-A₀ primer pair as described above, i.e., using a B₀primer having a 5′ reflex sequence and a corresponding A₀ primer withits corresponding 5′ adapter domain (e.g., a primer site).

It is noted here that because the particular sections of sequence to beanalyzed are defined by the A_(n)-B_(n) primer pairs (as shown anddescribed above), a much higher sequence specificity is achieved ascompared to using previous extraction methods that employ only a singleoligo binding event (e.g., using probes on a microarray).

FIG. 3 provides a detailed flow chart for an exemplary embodiment thatemploys reflex sequences for use in sequencing multiple specific regionsin a polynucleotide (i.e., regions 1, 2, 3, 4 and 5 in an 11 kb regionof lambda DNA).

A single parent DNA fragment 202 is generated that includes adapterdomains (i.e., a Roche 454 sequencing primer site, a single MID, and areflex sequence) and the sequence of interest. In the example shown, thesequence of interest is from lambda DNA and the reflex sequence ispresent on the top strand (with its complement shown in the bottomstrand). Any convenient method for producing this parent DNA fragmentmay be used, including amplification with a primer that includes theadapter domains (e.g., using PCR), cloning the fragment into a vectorthat includes the adapter domains (e.g., a vector with the adapterdomains adjacent to a cloning site), or by attaching adapters topolynucleotide fragments (e.g., fragment made by random fragmentation,by sequence-specific restriction enzyme digestion, or combinationsthereof). While only a single fragment with a single MID is shown, thesteps in FIG. 3 are applicable to samples having multiple differentfragments each with a different MID, e.g., a sample having a populationof homologous fragments from any number of different sources (e.g.,different individuals). FIG. 3 describes the subsequent enzymatic stepsinvolved in creating the five daughter fragments in which regions 1, 2,3, 4 and 5 (shown in polynucleotide 204) are rearranged to be placedwithin a functional distance of the adapter domains (i.e., close enoughto the adapter domains to be sequenced in a single Roche 454 sequencingreaction). Note that certain steps are shown for region 4 only (206).

In step 1, the five regions of interest are defined within the parentfragment (labeled 1 to 5 in polynucleotide 204) and corresponding primerpairs are designed for each. The distance of each region of interestfrom the reflex sequence is shown below polynucleotide 204. The primerpairs are designed as described and shown in FIG. 1B (i.e., theA_(n)-B_(n) primer pairs). For clarity, only primer sites for region 4are shown in FIG. 3 (“primer sites” surrounding region 4). In step 2,sequence specific primer extensions are performed (only region 4 isshown) with corresponding B_(n) primers to produce single strandedpolynucleotides having structure 208 (i.e., having the reflex sequenceon the 5′ terminus). As shown, the B_(n) primer for region 4 willinclude a sequence specific primer site that primes at the 3′-mostprimer site noted for region 4 (where “3′-most” refers to the templatestrand, which in FIG. 3 is the top strand). This polynucleotide iscopied back to produce polynucleotide 210 having reversed polarity(e.g., copied using a primer that hybridizes to the 454A′ domain).Polynucleotide 210 has structure similar to polynucleotide 100 shown atthe top of FIG. 1. Step 4 depicts the result of the intramolecularpriming between the reflex sequence and its complement followed byextension to produce the MID′ and 454A′ structures at the 3′ end(polynucleotide 212). In the embodiments shown in FIG. 3, polynucleotide212 is treated with T7 exonuclease to remove double stranded DNA fromthe 5′ end (as indicated above, this step is optional). Thepolynucleotide formed for region 4 is shown as 216 with polynucleotidesfor the other regions also shown (214).

It is noted here that the formation of each of the polynucleotides 214may be accomplished either in separate reactions (i.e., structure withregion 1 in proximity to the adapter domains is in a first sample, thestructure with region 2 in proximity to the adapter region is in asecond sample, etc.) or in one or more combined sample.

In step 6 the polynucleotides 214 are copied to reverse polarity to formpolynucleotides 218. In step 7, each of these products are then primedwith the second primer of the specific primer pair (see A_(n) primers asshown in FIG. 1B) each having a second Roche 454 primer site (454B)attached at the 5′ end, and extended to form products 220. Steps 6 and 7may be combined (e.g., in a single PCR or other amplification reaction).

In summary, FIG. 3 shows how the reflex process can be employed toproduce five daughter fragments 220 of similar length (e.g., ˜500 bp)each of which contain DNA sequences that differ in their distance fromthe reflex sequence in the starting structure 202 while maintaining theoriginal MID.

FIG. 4 shows another exemplary use of the reflex process as describedherein. In the embodiment shown in FIG. 4, a target sequence (i.e.,containing region of interest “E”) is enriched from a pool ofadapter-attached fragments. In certain embodiments, the fragments arerandomly sheared, selected for a certain size range (e.g., DNA having alength from 100 to 5000 base pairs), and tagged with adapters (e.g.,asymmetric adapters, e.g., as described in U.S. patent application Ser.No. 12/432,080, filed on Apr. 29, 2009). The asymmetric adaptor employedin FIG. 4 contains a sequencing primer site (454A, as used in the Roche454 sequencing platform), an MID, an X sequence, and an internal stemregion (ISR), which denotes the region of complementarity for theasymmetric adapter that is adjacent to the adapter attachment site (see,e.g., the description in U.S. application Ser. No. 12/432,080, filed onApr. 29, 2009, incorporated herein by reference in its entirety). The Xsequence can be any sequence that can serve as a binding site for apolynucleotide containing the complement of the X sequence (similar to aprimer site). As described below, the X sequence allows for theannealing of an oligonucleotide having a 5′ overhang that can serve as atemplate for extension of the 3′ end of the adaptor oligonucleotide. Thesequencing direction of the sequencing primer site (454A primer site instructure 401 of FIG. 4) is oriented such that amplification of theadapter ligated fragment using the sequencing primer site proceeds awayfrom the ligated genomic insert. This has the effect of making theinitial asymmetric adapter ligated library ‘inert’ to amplificationusing this primer, e.g., in a PCR reaction.

To extract a region of interest (the “E” region), the library is mixedwith an oligonucleotide (403) containing a 3′ X′ sequence and a targetspecific priming sequence (the 1′ sequence) underhybridization/annealing conditions. The target specific sequence 1′ isdesigned to flank one side of the region of interest (the 1′ sequenceadjacent to E in the genomic insert; note that only the E-containingpolynucleotide fragment is shown in FIG. 4), much like a PCR primer.After annealing primer 403, the hybridized complex is extended, wherebyall of the adaptor tagged fragments will obtain the complement of thetarget specific sequence (i.e., the 1 sequence) on the 3′ end (seestructure 405; arrows denote the direction of extension).

Extended products 405 are then denatured and the 1/1′ regions allowed tohybridize intramolecularly in a reflex process priming event, afterwhich nucleic acid extension is performed to form structure 407(extension is from the 1 priming site; shown with an arrow). This reflexreaction creates a product (407) that, unlike its parent structure(405), has a sequencing primer site (454A) that is oriented such theextension using this primer sequence proceeds towards the region ofinterest. Thus, in the absence of a priming and extension reflexreaction, extension with a sequencing primer will not generate a productcontaining the region of interest (the E region). In other words, onlyE-region containing target polynucleotides will have a 454A sequencethat can amplify genomic material (structure 407).

After completing the reflex process (using 1/1′ as the reflexsequences), a PCR amplification reaction is performed to amplify theregion of interest (with associated adapter domains). However, beforeperforming the PCR reaction, the fragment sample is “inactivated” fromfurther extension using terminal transferase and ddNTPs. Thisinactivation prevents non-target adaptor tagged molecules fromperforming primer extension from the 3′ primer 1 site. Once inactivated,a PCR reaction is performed using a sequencing primer (i.e., 454A primer409) and a second primer that primes and extends from the opposite sideof the region of interest (i.e., primer 411, which includes a 5′ 454Bsequencing primer site and a 3′ “2” region that primes on the oppositeend of E from the 1 region). Only fragments that have undergone thereflex process and contain the E region will be suitable templates forthe PCR reaction and produce the desired product (413).

Thus, the process exemplified in FIG. 4 allows for the movement of anadapter domain (e.g., containing functional elements and/or MID) intoproximity to a desired region of interest.

The reflex process described herein can be used to perform powerfullinkage analysis by combining it with nucleic acid counting methods. Anyconvenient method for tagging and/or counting individual nucleic acidmolecules with unique tags may be employed (see, e.g., U.S. Pat. No.7,537,897, issued on May 26, 2009; U.S. Pat. No. 7,217,522, issued onMay 15, 2007; U.S. patent application Ser. No. 11/377,462, filed on Mar.16, 2006; and U.S. patent application Ser. No. 12/432,080, filed on Apr.29, 2009; each of which is incorporated by reference herein in their).All of this can be conducted in parallel thus saving on the cost oflabor, time and materials.

In one exemplary embodiment, a large collection of sequences is taggedwith MID such that each polynucleotide molecule in the sample has aunique MID. In other words, each polynucleotide in the sample (e.g.,each individual double stranded or single stranded polynucleotide) istagged with a MID that is different from every other MID on every otherpolynucleotide in the sample. In general, to accomplish such moleculartagging the number of distinct MID tags to be used should be many timesgreater than the actual number of molecules to be analyzed. This willresult in the majority of individual nucleic acid molecules beinglabeled with a unique ID tag (see, e.g., Brenner et al., Proc. Natl.Acad. Sci. USA. 2000 97(4):1665-70). Any sequences that then result fromthe reflex process on that particular molecule (e.g., as describedabove) will thus be labeled with the same unique MID tag and thusinherently linked. Note that once all molecules in a sample areindividually tagged, they can be manipulated and amplified as much asneeded for processing so long as the MID tag is maintained in theproducts generated.

For example, we might want to sequence one thousand viral genomes (or aspecific genomic region) or one thousand copies of a gene present insomatic cells. After tagging each polynucleotide in the sample with asequencing primer site, MID and reflex sequence (as shown in the figuresand described above), we use the reflex process to break eachpolynucleotide into lengths appropriate to the sequencing procedurebeing used, transferring the sequencing primer site and MID to eachfragment (as described above). Obtaining sequence information from allof the reflex-processed samples can be used to determine the sequence ofeach individual polynucleotide in the starting sample, using the MIDsequence to defining linkage relationships between sequences fromdifferent regions in the polynucleotide being sequenced. Using asequencing platform with longer read lengths can minimize the number ofprimers to be used (and reflex fragments generated).

The advantages noted above are shown in FIG. 5. This figure shows acomparison of methods for identifying nucleic acid polymorphisms inhomologous nucleic acids in a sample (e.g., the same region derived froma chromosomal pair of a diploid cell or viral genomes/transcripts). Thetop schematic shows two nucleic acid molecules in a sample (1 and 2)having a different assortment of polymorphisms in polymorphic sites A, Band C (A1, B1, C1 and C2). Standard sequencing methods usingfragmentation (left side) can identify the polymorphisms in thesenucleic acids but do not retain linkage information. Employing thereflex process described herein to identify polymorphisms (right side)maintains linkage information. It is noted that not all domainstructures and steps are shown in the reflex process for simplicity.

Kits and Systems

Also provided by the subject invention are kits and systems forpracticing the subject methods, as described above, such vectorsconfigured to add reflex sequences to nucleic acid inserts of interestand regents for performing any steps in the cloning or reflex processdescribed herein (e.g., restriction enzymes, nucleotides, polymerases,primers, exonucleases, etc.). The various components of the kits may bepresent in separate containers or certain compatible components may beprecombined into a single container, as desired.

The subject systems and kits may also include one or more other reagentsfor preparing or processing a nucleic acid sample according to thesubject methods. The reagents may include one or more matrices,solvents, sample preparation reagents, buffers, desalting reagents,enzymatic reagents, denaturing reagents, where calibration standardssuch as positive and negative controls may be provided as well. As such,the kits may include one or more containers such as vials or bottles,with each container containing a separate component for carrying out asample processing or preparing step and/or for carrying out one or moresteps of a nucleic acid variant isolation assay according to the presentinvention.

In addition to above-mentioned components, the subject kits typicallyfurther include instructions for using the components of the kit topractice the subject methods, e.g., to prepare nucleic acid samples forperform the reflex process according to aspects of the subject methods.The instructions for practicing the subject methods are generallyrecorded on a suitable recording medium. For example, the instructionsmay be printed on a substrate, such as paper or plastic, etc. As such,the instructions may be present in the kits as a package insert, in thelabeling of the container of the kit or components thereof (i.e.,associated with the packaging or sub-packaging) etc. In otherembodiments, the instructions are present as an electronic storage datafile present on a suitable computer readable storage medium, e.g.CD-ROM, diskette, etc. In yet other embodiments, the actual instructionsare not present in the kit, but means for obtaining the instructionsfrom a remote source, e.g. via the internet, are provided. An example ofthis embodiment is a kit that includes a web address where theinstructions can be viewed and/or from which the instructions can bedownloaded. As with the instructions, this means for obtaining theinstructions is recorded on a suitable substrate.

In addition to the subject database, programming and instructions, thekits may also include one or more control samples and reagents, e.g.,two or more control samples for use in testing the kit.

Utility

The reflex process described herein provides significant advantages innumerous applications, a few of which are noted below (as well asdescribed above).

For example, as described above, certain aspects of the reflex processdefine the particular sections of sequence to be analyzed by a primerpair, as in PCR (e.g., the two oligos shown as A_(n)-B_(n) in FIG. 1i ).This results in higher sequence specificity as compared to otherextraction methods (e.g., using probes on a microarray) that only use asingle oligo sequence. The separation of the probes defines a lengththat can be relatively uniform (hence making subsequent handlingincluding amplification more uniform) and can also be tailored to theparticular sequencing platform being employed.

Further, as described above, aspects of the present invention can beused to analyze homologous genomic locations in a multiplexed sample(i.e., a sample having polynucleotides from different genomic samples)in which the polynucleotides are tagged with the MID. This is possiblebecause the reflex process, which operates intramolecularly, maintainsthe MID thus linking any particular fragment to the sample from which itoriginates.

Finally, as the reflex processes described herein functionintramolecularly, one can determine the genetic linkage betweendifferent regions on the same large fragment that are too far apart tobe sequenced in one sequence read. Such a determination of linkage maybe of great value in plant or animal genetics (e.g., to decide if aparticular set of variations are linked together on the same stretch ofchromosome) or in viral studies (e.g., to determine if particularvariations are linked together on the same stretch of a viralgenome/transcripts, e.g., HIV, hepatitis virus, etc.).

EXAMPLES Example I

FIGS. 6 and 7 provide experimental data and validation of the reflexprocess described herein using synthetic polynucleotide substrates.

Methods Substrate:

The 100 base oligonucleotide substrate (as shown diagrammatically inFIG. 6A) was synthesized with internal fluorescein-dT positioned betweenthe REFLEX and REFLEX′ sequences. This label provides convenient andsensitive method of detection of oligonucleotide species usingpolyacrylamide gel electrophoresis.

Extension Reactions:

Reactions were prepared which contained 1 μM of the 100 baseoligonucleotide substrate, 200 μM dNTPs, presence or absence of 1 μMcompetitor oligonucleotide, 0.5 μl of each DNA polymerase (“DNAP”): Vent(NEB, 2 units/μl), Taq (Qiagen HotStarTaq 5 units/μl) and Herculase(Stratagene), and made up to 50 μl with the appropriate commercialbuffers for each polymerase and dH₂O. For Taq titrations 0.5 μl, 1 μl, 2μl, and 3 μl enzyme was used in the same 50 μl volume. Reactions wereheated in a Biometra thermocycler to 95° C. for 15 minutes (Taq) or 5minutes (Herculase, Vent), followed by 55° C. or 50° C. for 30 seconds,and a final incubation at 72° C. for 10 minutes.

T7 Exonuclease Digestions:

Reactions were prepared with 10 μl extension reactions above, 0.5 μl T7exonuclease (NEB, 10 units/μl), and made up to 50 μl using NEB Buffer 4and dH₂O. Reactions were incubated at 25° C. for 30 minutes.

Gel Electrophoresis Analysis:

An 8% denaturing polyacrylamide gel was used to analyze reactionspecies. 0.4 μl of extension reactions, and 2 μl of digestion reactionswere loaded and ran at 800V for ˜1.5 hours. Gels were analyzed forfluorescein using an Amersham/General Electric Typhoon imager.

Results

FIG. 6A shows the structure of each stage of reflex sequence processingwith the expected nucleic acid size shown on the left. The initialsingle stranded nucleic acid having a sequencing primer site (the Roche454 sequencing primer A site; listed as 454A); an MID; a reflexsequence; the insert; and a complement of the reflex sequence is 100nucleotides in length. After self-annealing and extension, the productis expected to be 130 nucleotides in length. After removal of the doublestranded region from the 5′ end, the nucleic acid is expected to be 82bases in length.

FIG. 6B shows the results of three experiments using three differentnucleic acid polymerases (Vent, Herculase and Taq, indicated at the topof the lanes). The temperature at which the annealing was carried out isshown at the top of each lane (either 50° C. or 55° C.). The sizes ofthe three nucleic acids as noted above are indicated on the left andright side of the gel.

As shown in FIG. 6B, extension appears to be most efficient under theconditions used with Herculase (Herculase is a mixture of two enzymes:modified Pfu DNAP and Archaemax (dUTPase)). Most (or all) of the initial100 base pair nucleic acid are converted to the 130 base pair product(see lanes 6 and 7). However, after T7 exonuclease digestion the 3′-5′exonuclease activity of Herculase results in partial digestion of thedesired 82 base product (note bands at and below the 82 base pairs inlanes 8 and 9).

Taq, which lacks 3′-5′ exonuclease activity, shows a stronger band atthe expected size of the final product after T7 exonuclease digestion(see lane 13).

FIG. 7 shows the effect on the reflex process of increasing amounts ofTaq polymerase as well as the use of a reflex sequence competitor(schematically shown in FIG. 7A).

As shown in lanes 2 to 5, increased Taq concentration improves extensionto ˜90% conversion of the starting nucleic acid (see lane 5). Lanes 7 to8 show that T7 exonuclease digestion does not leave a perfect 82 baseproduct. This may be due to collapse of dsDNA when T7 exonuclease hasnearly completed its digestion from the 5′ end in the double strandedregion of the fold-back structure. It is noted that in many embodiments,the removal of a few additional bases from the 5′ end of thepolynucleotide will not interfere with subsequent analyses, asnucleotide bases at the 5′ end are often removed during subsequentsteps.

As shown in Lanes 11-14, addition of a competitor (which can interferewith annealing of the reflex sequences to form the fold-back structure)results in only a small decrease (˜5-10%) of fully extended product.Thus, as expected, the intramolecular reaction is heavily favored.Although not shown, we have observed that the competitor oligonucleotidealso gets extended by the same amount (˜5-10%).

The concentration of the competitor, the concentration of the reflexsubstrate, and the overall genetic complexity, will all likely affectspecific results. The experiments shown in FIGS. 6 and 7 demonstratethat the core parts of the reflex processes as described herein isfunctional and can be implemented.

Example II

FIG. 8 shows the reflex workflow (diagram at left) and exemplary resultsof the workflow (gel at right) for a specific region of interest (ROI).The starting material is a double stranded nucleic acid molecule (700)that contains a 454A primer site, an MID, a reflex site, and apolynucleotide of interest having three ROIs (2, 3 and 4) at differentlocations therein. This starting material was subjected to reflexprocesses (as described in above) specific for ROI 2 as shown in thediagram at the left of the figure, both with and without the use of a T7exonuclease step (the T7 exonuclease step is shown in the diagram isindicated as “Optional”).

Completion of all steps shown in the reflex process should result in adouble stranded polynucleotide of 488 base pairs (702) with or withoutthe T7 exonuclease step.

As shown in the gel on the right of FIG. 8, the 488 base pair productwas produced in reflex processes with and without the T7 exonucleasestep.

FIG. 9 shows an exemplary protocol for a reflex process based on theresults discussed above. The diagram shows specific reflex process stepswith indications on the right as to where purification of reactionproducts is employed (e.g., using Agencourt SPRI beads to remove primeroligos). One reason for performing such purification steps is to reducethe potential for generating side products in a reaction (e.g.,undesirable amplicons). While FIG. 9 indicates three purification steps,fewer or additional purification steps may be employed depending on thedesires of the user. It is noted that the steps of reversing polarity,reflex priming and extension, and “stretch out” (or denaturation)/secondreversing polarity step can be performed without interveningpurification steps.

The protocol shown in FIG. 9 includes the following steps:

annealing a first primer containing a 5′ reflex sequence (or reflextail, as noted in the figure) specific for the 3′ primer site for the R′region to the starting polynucleotide and extending (the primer annealsto the top strand at the primer site at the right of R in polynucleotide902, indicated with a *; this step represents the first denature, annealand extend process indicated on the right);

after purification, adding a 454A primer and performing three cycles ofdenaturing, annealing and extending: the first cycle results in thecopy-back from the 454A primer to reverse the polarity of the strandjust synthesized; the second cycle breaks apart the double strandedstructure produced, allows the reflex structure to form and then extend;the third cycle results in another copy-back using the same 454A primeroriginally added; after purification, adding a second primer specificfor the second primer site for the R′ region having a 5′ 454B tail (thisprimer anneals to the primer site 3‘ of the R’ region in polynucleotide904, indicated with a *) and denaturing, annealing and extendingresulting in a polynucleotide product having 454A and 454B sitessurrounding the MID, the reflex sequence, and R′. Note that the firstprimer specific for the R′ region and the second primer specific for theR′ region define its boundaries, as described above and depicted in FIG.1B); after another purification, adding 454A and 454B primers andperforming a PCR amplification reaction.

Example III

As described above, a reflex sequence can be an “artificial” sequenceadded to a polynucleotide as part of an adapter or can be based on asequence present in the polynucleotide of interest being analyzed, e.g.,a genomic sequence (or “non-artificial”).

The data shown in prior Examples used “artificial” reflex sites. In thisExample, the reflex site is a genomic sequence present in thepolynucleotide being analyzed.

The starting material is a double stranded DNA containing a 454A site,an MID and a polynucleotide to be analyzed. The 454A and MID were addedby adapter ligation to parent polynucleotide fragments followed byenrichment of the polynucleotide to be analyzed by a hybridization-basedpull-out reaction and subsequent secondary PCR amplification (see Route1 in FIG. 13). Thus, the reflex site employed in this example is asequence normally present at the 5′ end of the subject polynucleotide (agenomic sequence). The polynucleotide being analyzed includes a regionof interest distal to the 454A and MID sequences that is 354 base pairsin length.

This starting double stranded nucleic acid is 755 base pairs in length.Based on the length of each of the relevant domains in this startingnucleic acid, the reflex process should result in a product of 461 basepairs.

FIG. 10 shows the starting material for the reflex process (left panel)and the resultant product generated using the reflex process (rightpanel; reflex process was performed as described in Example II, withoutusing a T7 exonuclease step). A size ladder is included in the left handlane of each gel to allow estimation of the size of the test material.This figure shows that the 755 base pair starting nucleic acid wasprocessed to the expected 461 base pair product, thus confirming that a“non-artificial” reflex site is effective in moving an adapter domainfrom one location to another in a polynucleotide of interest in asequence specific manner.

Example IV

FIG. 11 shows a schematic of an experiment in which the reflex processis performed on a single large initial template (a “parent” fragment) togenerate 5 different products (“daughter” products) each having adifferent region of interest (i.e., daughter products are producedhaving either region 1, 2, 3, 4 or 5). The schematic in FIG. 11 showsthe starting fragment (11,060 base pairs) and resulting products (each488 base pairs) generated from each of the different region ofinterest-specific reflex reactions (reflex reactions are performed asdescribed above). The panel (gel) on the bottom of FIG. 11 shows thelarger starting fragment (Lane 1) and the resulting daughter productsfor each region-specific reflex reaction (lanes 2 to 6, with the regionof interest noted in each in the box), where the starting and daughterfragments have the expected lengths. Sequencing of the productsconfirmed the identity of the region of interest in each of the reflexproducts shown in the gel. These results demonstrate that multipledifferent reflex products can be generated from a single, asymmetricallytagged parent fragment while maintaining the adapter domains (e.g., theprimer sites and MID).

Example V

FIG. 12 details experiments performed to determine the prevalence ofintramolecular rearrangement (as desired in the reflex process) vs.intermolecular rearrangement. Intermolecular rearrangement isundesirable because it can lead to the transfer of an MID from onefragment to another (also called MID switching). MID switching can occurif a reflex sequence in a first fragment hybridizes to its complement ina second fragment during the reflex process, leading to appending theMID from the second fragment to the first fragment. Thus, intermolecularrearrangement, or MID switching, should be minimized to prevent thetransfer of an MID from one fragment in the sample to another, whichcould lead to a misrepresentation of the source of a fragment.

To measure the prevalence of MID switching under different reflexconditions, fragments having different sizes were generated thatincluded two different MIDs, as shown in the top panel of FIG. 12. Thecommon sequence on these fragments serves as the priming site for thefirst extension reaction to add the second reflex sequence (see, e.g.,step 2 of FIG. 3). Three exemplary fragments are shown in FIG. 12 foreach different fragment size (i.e., 800 base pairs with an MIDB and MIDAcombination; 1900 base pairs with MIDC and MIDA combination; and 3000base pairs with MIDD and MIDA combination). For each MID family (A, B, Cand D), there are 10 different members (i.e., MIDA had 10 differentmembers, MIDB has 10 different members, etc.). A set of 10 dual MIDfragments for each different size fragment (i.e., 800, 1900 and 3000base pairs) were generated, where the MID pairs (i.e., MIDA/MIDB,MIDA/MIDC, and MIDA/MIDD) were designated as 1/1, 2/2, 3/3, 4/4, 5/5,6/6, 7/7, 8/8, 9/9, and 10/10. All 10 fragments of the same size werethen mixed together and a reflex protocol was performed.

Due to the domain structure of the fragments, a successful reflexprocess results in the two MIDs for each fragment being moved to withinclose enough proximity to be sequenced in a single read using the Roche454 sequencing platform (see the reflex products shown in the schematicof FIG. 12). The reflex reactions for each fragment size were performedat four different fragment concentrations to determine the effect ofthis parameter, as well as fragment length, in the prevalence of MIDswitching. The reflex products from each reaction performed weresubjected to 454 sequencing to determine the identity of both MIDs oneach fragment, and thereby the proportion of MID switching thatoccurred.

The panel on the bottom left of FIG. 12 shows the rate of MID switching(Y axis, shown in % incorrect (or switched) MID pair) for each differentlength fragment at each different concentration (X axis; 300, 30, 3 and0.3 nM). As shown in this panel, the MID switch rate decreases withlower concentrations, as would be expected, because intermolecular, asopposed to intramolecular, binding events are concentration dependent(i.e., lower concentrations lead to reduced intermolecularhybridization/binding). In addition, the MID switch rate decreasesslightly with length. This is somewhat unexpected as the ends of longerDNA fragments are effectively at a lower concentration with respect toone another. The reasons for why we do not see this is probably becausethe production of reflex priming intermediates continues during thefinal PCR, which means that reflex priming reactions are happeningcontinuously which contributes to MID switching. It is probably the casethat the shorter reflex products are able to undergo a higher rate of‘background’ reflexing, and therefore increase the overall MID switchrate a little.

These results demonstrate that MID switching can be minimized (e.g., tobelow 2%, below 1% or even to nearly undetectable levels) by alteringcertain parameters of the reaction, e.g., by reducing fragmentconcentration and/or fragment length.

The panel on the bottom right of FIG. 12 shows the frequency of MIDswitching in the reflex process for the 800 base pair fragments (i.e.,MIDA/MIDB containing fragments). In this figure, the area of each circleis proportional to the number of reads containing the corresponding MIDAand MIDB species (e.g., MIDA1/MIDB1; MIDA1/MIDB2; etc.). Thus, a circlerepresenting 200 reads will be 40 times larger in terms of area than acircle representing 5 reads.

As noted above, the MIDA/MIDB combinations having the same number (shownon the X and Y axis, respectively) represent the MIDA/MIDB combinationspresent in the sample prior to the reflex process being performed (i.e.,MIDA/MIDB combinations 1/1, 2/2, 3/3, 4/4, 5/5, 6/6, 7/7, 8/8, 9/9, and10/10 were present in the starting sample). All other MIDA/MIDBcombinations identified by Roche 454 sequencing were the result of MIDswitching.

This figure shows that the MID switching that occurs during the reflexprocess is random, i.e., that MID switching is not skewed based on theidentity of the MIDs in the reaction).

Exemplary Reflex Protocols

FIG. 13 shows a diagram of exemplary protocols for performing the reflexprocess on pools of nucleic acids, for example, pools of nucleic acidsfrom different individuals, each of which are labeled with a unique MID.In Route 3, a pooled and tagged extended library is subjected directlyto a reflex process. In Route 2, the pooled library is enriched bytarget-specific hybridization followed by performing the reflex process.In Route 1 employs enrichment by PCR amplification. As shown in FIG. 13,PCR enrichment can be performed directly on the pooled tagged extendedlibrary or in a secondary PCR reaction after a hybridization-basedenrichment step has been performed (as in Route 2) to generate anamplicon substrate that is suitable for the reflex process. Additionalroutes for preparing a polynucleotide sample for performing a reflexprocess can be implemented (e.g., having additional amplification,purification, and/or enrichment steps), which will generally bedependent on the desires of the user.

Although the foregoing invention has been described in some detail byway of illustration and example for purposes of clarity ofunderstanding, it is readily apparent to those of ordinary skill in theart in light of the teachings of this invention that certain changes andmodifications may be made thereto without departing from the spirit orscope of the appended claims.

Accordingly, the preceding merely illustrates the principles of theinvention. It will be appreciated that those skilled in the art will beable to devise various arrangements which, although not explicitlydescribed or shown herein, embody the principles of the invention andare included within its spirit and scope. Furthermore, all examples andconditional language recited herein are principally intended to aid thereader in understanding the principles of the invention and the conceptscontributed by the inventors to furthering the art, and are to beconstrued as being without limitation to such specifically recitedexamples and conditions. Moreover, all statements herein recitingprinciples, aspects, and embodiments of the invention as well asspecific examples thereof, are intended to encompass both structural andfunctional equivalents thereof. Additionally, it is intended that suchequivalents include both currently known equivalents and equivalentsdeveloped in the future, i.e., any elements developed that perform thesame function, regardless of structure. The scope of the presentinvention, therefore, is not intended to be limited to the exemplaryembodiments shown and described herein. Rather, the scope and spirit ofpresent invention is embodied by the appended claims.

What is claimed:
 1. A method for biological analysis, the methodcomprising: (a) providing a plurality of polynucleotides derived from atissue sample; (b) generating a plurality of tagged polynucleotides fromsaid plurality of polynucleotides and a plurality of oligonucleotidetags, wherein a tagged polynucleotide of the plurality of taggedpolynucleotides comprises: (i) a first tag sequence corresponding to asite within said tissue sample; and (ii) a second tag sequencedistinguishing said tagged polynucleotide from other taggedpolynucleotides of the plurality of tagged polynucleotides; (c)sequencing said plurality of tagged polynucleotides to determinesequences corresponding to the plurality of polynucleotides, includingthe first tag sequence and the second tag sequence; and (d) using thesequences determined in step (c) to count polynucleotides for multipledifferent polynucleotides derived from said tissue sample.